Introduction

We are replicating the result from paper Predicting the Present with Google Trends.

Replicate results from 3.1 Motor vehicles and parts

Replication with authors’ data

Replicate baseline seasonal AR-1 model

We began the replication process by using the data that was used for the paper. According to the paper, the model follows the form of \(y_t = b_1y_{t-1}+b12y_{t-12}+e_t\)(AR-1 model). We used the codes below to obtain the same summary table to the model.

# load the data that was used for the paper
merged <- read_csv("merged.csv")

# take the log of all sales values
merged$sales<-log(merged$sales)

# apply lm(). lag() is used to capture y_t-1 and y_t-12
model1 <- lm(data = merged, sales~lag(sales, 1)+lag(sales,12))

#the summary of the model
summary(model1)
## 
## Call:
## lm(formula = sales ~ lag(sales, 1) + lag(sales, 12), data = merged)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.209554 -0.034684  0.002482  0.040477  0.220976 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     0.67266    0.76355   0.881 0.381117    
## lag(sales, 1)   0.64345    0.07332   8.776 3.59e-13 ***
## lag(sales, 12)  0.29565    0.07282   4.060 0.000118 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07985 on 76 degrees of freedom
##   (12 observations deleted due to missingness)
## Multiple R-squared:  0.7185, Adjusted R-squared:  0.7111 
## F-statistic:    97 on 2 and 76 DF,  p-value: < 2.2e-16

The summary we obtained is identical to the paper.

Replicate figure 2

The authors used rolling window forecast to confirm that the Trends variables improve out-of-sample forecasting. What is rolling window forecast? That’s say now is July, 2015. We would use all the actual data that is available to train a model and predict August, 2015. We’ll keep this predicted value aside. Imagine now is August, 2015, we have all the actual data(up to July, 2015) available to train another model to predict September, 2015. Repeat this process for baseline model and the model with trends.

# creating base 
base <- merged # a place holder for predicted sales from baseline model
# rolling window forcast
for (i in 18:91){
  merged_t <- merged[1:i-1,]
  model1 <- lm(data = merged_t, sales~lag(sales, 1)+lag(sales,12))
  base$sales[i] <- predict(model1,merged[1:i,])[i]
}
base <- base[18:91,]# select out all the predicted sales


# creating trends 
trends <- merged
for (i in 18:91){
  merged_t <- merged[1:i-1,]
  model1 <- lm(data = merged_t, sales~lag(sales, 1)+lag(sales,12)+ suvs + insurance)
  trends$sales[i] <- predict(model1,merged[1:i,])[i]
}
trends <- trends[18:91,]


# Make the graph
# add labels to the data
actual <- merged[18:91,] %>% 
  mutate(label ="actual")
base <- base %>% 
  mutate(label = "base")
trends <- trends %>% 
  mutate(label ="trends")

# means absolute error
mean(abs(trends$sales-actual$sales))
## [1] 0.05667658
mean(abs(base$sales-actual$sales))
## [1] 0.06343984
# data for the recession period Dec 2007 to June 2009
recession_trends <- trends %>% 
  filter(Period>="2007-12-01"& Period<="2009-06-01")
recession_base <- base %>% 
  filter(Period>="2007-12-01"& Period<="2009-06-01")
recession_actual <- actual %>% 
  filter(Period>="2007-12-01"& Period<="2009-06-01")

# means absolute error for the recession period
# MAE for the model with trends
mean(abs(recession_trends$sales-recession_actual$sales))
## [1] 0.06965812
# MAE for the model for baseline model
mean(abs(recession_base$sales-recession_actual$sales))
## [1] 0.08869325
# Overall improvement 
(mean(abs(base$sales-actual$sales))-mean(abs(trends$sales-actual$sales)))/mean(abs(base$sales-actual$sales))
## [1] 0.1066089
# recession improvement
(mean(abs(recession_base$sales-recession_actual$sales))-mean(abs(recession_trends$sales-recession_actual$sales)))/mean(abs(recession_base$sales-recession_actual$sales))
## [1] 0.2146175
# Note: the improvements they stated in the paragraph were different from what they labeled on the graph. First, I thought they used some kind of function, but the MAE() from library(caret) gives same value.


# R^2 for base
(cor(base$sales,actual$sales))^2
## [1] 0.7241049
# R^2 for trends
(cor(trends$sales,actual$sales))^2
## [1] 0.7928897
# combine data for plotting
plot_data <- rbind(actual, base, trends)
ggplot(plot_data, aes(x=Period, y = sales, color = label, linetype = label))+
  geom_line()+
  scale_colour_manual(values=c("black", "red","grey"))+
  scale_linetype_manual(values = c("solid", "dashed", "solid"))+
  ylab('log(mvp)')+
  xlab('Index')+
  labs(subtitle ="MAE improvement: Overall = 10.66%  During recession = 21.46%" )

Replication with raw data

Where did we find the data?

The sales data: We first tried the link provided in the paper, but the data is adjusted. We were able to find the unadjusted data through this link. However, the values were different from the original data(used by the authors of the paper). The difference increases as year increases. The reason for that is still unclear. We will compare the two data set in the next session.

The search data: Trucks & SUVs Auto Insurance

Comparing sales data used in the paper with data we obtained from census.gov

We noticed that the sales from the data we obtained were always less than the data used in the paper. We want to find out why.

# load the data
sales <- read_excel("sales.xls")
merged <- read_csv("merged.csv")
# Period has type char in the data, convert that to yearmonth 
ym <- as.yearmon(sales$Period, "%b-%Y")
# use as.Date() to convert the type of ym to date
sales$Period <- as.Date(ym)
#keep only data from 01/2004 through 07/2011 and renames the data frame as unadj, rename the column Value as sales,and add label "unadjusted"
unadj<- sales %>% 
  filter(Period <="2011-07-01") %>% 
  mutate(label = "unadjusted") %>% 
  rename(sales = Value)
# convert sales to numeric 
unadj$sales <-as.numeric(unadj$sales)
# add label to original data, keep only Period, sales and label
orig <- merged %>% 
  mutate(label = "original") %>% 
  select(-insurance, -suvs)
# stack two data frame
com_data <- rbind(unadj, orig)

#plot log(sales) Vs Period 
ggplot(com_data, aes(x=Period, y = log(sales), color = label)) +
  geom_line()

# plot orignal against unadjusted
joined_data <- unadj %>% select(-label) %>% rename(unadjusted_sales = sales) %>% left_join(orig, by ="Period") %>% select(-label)

ggplot(joined_data, aes(x = sales, y = unadjusted_sales)) +
  geom_point()+
  geom_abline(linetype = "dashed") +
  xlab('original') +
  ylab('unadjusted')+
  ggtitle("original sales against unadjusted sales")

From the first graph, we can see that the sales values from both original and unajusted are very close to each other(almost overlapping). When I plotted original data against unadjusted data, they almost lined up on the diagonal. Note that original is always less than unadjusted. I wonder if people went back and modified that data after the authors obtained the data. Further investigation is needed. In conclusion, the unadjusted data we found is very close to the original data.

Extensions

Replication with data between 08/01/2011- 12/01/2019

suvs_trends <- read_csv("trucks_suvs_2004_2020.csv")
insurance_trends <- read_csv("auto_insurance_2004_2020.csv")
# rename
names1<- names(suvs_trends)
names2 <- names(insurance_trends)
suvs_trends <- suvs_trends %>% 
  rename(suvs = names1[2],
         Period = Month)
insurance_trends <- insurance_trends %>% 
  rename(insurance = names2[2], 
         Period = Month)

start_month <- "2011-07-01" # "2004-01-01"
end_month <- "2019-12-01"   # "2020-05-01"

unadj_full <- sales %>% filter(Period <= end_month & Period >start_month)
trends_full <- left_join(insurance_trends, suvs_trends, by = "Period") %>% mutate(Period = as.Date(as.yearmon(Period, "%Y-%m"))) 
with_trends_full <- left_join(unadj_full, trends_full, by = "Period") %>% rename(sales = Value) 
with_trends_full$sales = log(as.numeric(with_trends_full$sales))

# apply lm(). lag() is used to capture y_t-1 and y_t-12
model1 <- lm(data = with_trends_full, sales~lag(sales, 1)+lag(sales,12))

#the summary of the model
summary(model1)
## 
## Call:
## lm(formula = sales ~ lag(sales, 1) + lag(sales, 12), data = with_trends_full)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.07213 -0.01791  0.00315  0.02055  0.06457 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     1.56060    0.28128   5.548 3.14e-07 ***
## lag(sales, 1)   0.01167    0.03899   0.299    0.765    
## lag(sales, 12)  0.85541    0.03484  24.550  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02794 on 86 degrees of freedom
##   (12 observations deleted due to missingness)
## Multiple R-squared:  0.9485, Adjusted R-squared:  0.9473 
## F-statistic: 792.4 on 2 and 86 DF,  p-value: < 2.2e-16
# model with trends
model_with_trend <- lm(data = with_trends_full, sales~lag(sales, 1)+lag(sales,12) + suvs + insurance)

# summary table of the model
summary(model_with_trend)
## 
## Call:
## lm(formula = sales ~ lag(sales, 1) + lag(sales, 12) + suvs + 
##     insurance, data = with_trends_full)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.07066 -0.01718 -0.00007  0.01403  0.07169 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     0.4104669  0.4317867   0.951  0.34452    
## lag(sales, 1)   0.0374763  0.0377719   0.992  0.32396    
## lag(sales, 12)  0.9351565  0.0405170  23.081  < 2e-16 ***
## suvs           -0.0019539  0.0006383  -3.061  0.00296 ** 
## insurance       0.0020128  0.0010723   1.877  0.06397 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0265 on 84 degrees of freedom
##   (12 observations deleted due to missingness)
## Multiple R-squared:  0.9548, Adjusted R-squared:  0.9526 
## F-statistic: 443.2 on 4 and 84 DF,  p-value: < 2.2e-16
# Rolling window 
base <- with_trends_full
len <- nrow(base)
# creating predicted base with rolling window forecast
for (i in 18:len){
  merged_t <- with_trends_full[1:i-1,]
  model1 <- lm(data = merged_t, sales~lag(sales, 1)+lag(sales,12))
  base$sales[i] <- predict(model1,with_trends_full[1:i,])[i]
}
base <- base[18:len,]


# creating predicted trends with rolling window forecast
trends <- with_trends_full
for (i in 18:len){
  merged_t <- with_trends_full[1:i-1,]
  model1 <- lm(data = merged_t, sales~lag(sales, 1)+lag(sales,12)+ suvs + insurance)
  trends$sales[i] <- predict(model1,with_trends_full[1:i,])[i]
}
trends <- trends[18:len,]

# Make the graph
actual <- with_trends_full[18:len,]
actual <- actual %>% 
  mutate(label ="actual")
base <- base %>% 
  mutate(label = "base")
trends <- trends %>% 
  mutate(label ="trends")
plot_data <- rbind(actual, base, trends)
ggplotly(ggplot(plot_data, aes(x=Period, y = sales, color = label, linetype = label))+
  geom_line()+
  scale_colour_manual(values=c("black", "red","grey"))+
  scale_linetype_manual(values = c("solid", "dashed", "solid"))+
  ylab('log(mvp)')+
  xlab('Index'))
# R^2 for baseline model 
r_sqr_base <-(cor(base$sales,actual$sales))^2

# R^2 for model with trends

r_sqr_trends <- (cor(trends$sales,actual$sales))^2

We applied the same procedures to data between 2011-07-01 and 2019-12-01. We don’t see much improvement on the fitness of the model. When we applied rolling window forecasting, we got \(R^2\) = 0.9375906 for baseline model, and \(R^2\) = 0.9168464. This is suggesting that adding trends data to the model may decreases prediction accuracy. We also tried this procedure with data from different time period, the results are similar.

Forecasting with new model

We would to like to explore new models for forecasting and nowcasting. We examined how many models. ADD more here At the end, We will try to predict the sales of July, 2020 using these models.

  • We want forecast by using the baseline model \(y_{t+2}=b_1y_{t}+b_2y_{t-12} + e_t\) and the model \(y_{t+2}=b_1y_{t}+b_2y_{t-12} + b_3insurance + b_4SUVs + e_t\) (search data from month t). We’ll use rolling window forecast to evaluate our model(from 01/01/2017 to 01/01/2019) .
## getting data ready
suvs_trends <- read_csv("trucks_suvs_2004_2020.csv")
insurance_trends <- read_csv("auto_insurance_2004_2020.csv")
# rename
names1<- names(suvs_trends)
names2 <- names(insurance_trends)
suvs_trends <- suvs_trends %>% 
  rename(suvs = names1[2],
         Period = Month)

insurance_trends <- insurance_trends %>% 
  rename(insurance = names2[2], 
         Period = Month)
unadj_full <- sales %>% filter(Period <= "2020-05-01")


##Join data
trends_full <- left_join(insurance_trends, suvs_trends, by = "Period") %>% mutate(Period = as.Date(as.yearmon(Period, "%Y-%m"))) 
with_trends_full <- left_join(unadj_full, trends_full, by = "Period") %>% rename(sales = Value) 
with_trends_full$sales = log(as.numeric(with_trends_full$sales))
#with_trends_full$sales = as.numeric(with_trends_full$sales)


## starting from 2017
whole_data_2017 <- with_trends_full %>% 
  filter(Period >="2013-01-01" & Period <="2019-01-01") # we want to start predict 2017-01-01 when it is 2016-10-01, we need data from at least from 2015-10-01 (We want more data before first prediction to train a good model)
months <- nrow(whole_data_2017)

 
base <- whole_data_2017
start <- 46 # the index for 2016-10-01
# creating predicted base with rolling window nowcast
for (i in start:months){
  merged_t <- whole_data_2017[1:i-1,]
  model1 <- lm(data = merged_t, sales~lag(sales, 2)+lag(sales,14))
  base$sales[i] <- predict(model1,whole_data_2017[1:i,])[i]
}
base <- base[start:months,]


# creating predicted trends with rolling window nowcast

trends <- whole_data_2017
for (i in start:months){
  merged_t <- whole_data_2017[1:i-1,]
  model1 <- lm(data = merged_t, sales~lag(sales, 2)+lag(sales,14)+ lag(suvs,2) + lag(insurance,2)) # we used search data from the month when we were making the prediction
  trends$sales[i] <- predict(model1,whole_data_2017[1:i,])[i]
}
trends <- trends[start:months,]

# Make the graph

actual <- whole_data_2017[start:months,]
actual <- actual %>% 
  mutate(label ="actual")
base <- base %>% 
  mutate(label = "base")
trends <- trends %>% 
  mutate(label ="trends")
plot_data <- rbind(actual, base, trends) 
ggplotly(ggplot(plot_data, aes(x=Period, y = sales, color = label, linetype = label))+
  geom_line()+
  scale_colour_manual(values=c("black", "red","grey"))+
  scale_linetype_manual(values = c("solid", "dashed", "solid"))+
  ylab('log(mvp)')+
  xlab('Index'))
MAE(trends$sales, actual$sales)
## [1] 0.06306789
MAE(base$sales, actual$sales)
## [1] 0.06151782
(MAE(base$sales, actual$sales)-MAE(trends$sales, actual$sales))/MAE(base$sales, actual$sales)
## [1] -0.02519705
#R^2 for base
(cor(base$sales,actual$sales))^2
## [1] 0.01782038
# R^2 for trends
(cor(trends$sales,actual$sales))^2
## [1] 0.01445199

Both models performed terribly. We were curious about the reason. We plotted Period VS sales for data from 01/01/2004 through 05/01/2020.

# yearly trends
ggplotly(ggplot(with_trends_full, aes(x = Period, y = sales)) + geom_line())

From this graph, we observed annual trend. The beginning and the end of a year tend to be the lowest point, mid-year tends to be the highest point, and the overall sales is increase(exceptions during the recession and 2020). This explains why the model \(y_{t+2}=b_1y_{t}+b_2y_{t-12} + e_t\) performed poorly. In this model, we used data from current month and 12 months ago to predict what would happen two month later. This violates the annual trend the sales obeys.

  • Inspired by the annual trend, we want to test a different model for forecasting. \(y_t=b_1y_{t-12}+b_2y_{t-24}+b_3y_{t-36}+b_4y_{t-48}+b_5insurance + b_6SUVs\) Since there is a strong annual trend, we wonder if use the data from previous years of the same month as a predictor will give us a better model.

we used data from 2012-01-01 through 2018-12-01 to train the model, and make predict what would the sales for the whole year of 2019.

Note: we tried to use different time period for training the model. It appears that use data previous 6 years of data to train the model is best.

# Used data from 2012-01-01 through 2018-12-01 to train the model
training_data <- with_trends_full %>% 
  filter(Period >="2012-01-01"& Period <="2018-12-01")

# I'm using trends data from previous year of the same month for training
model_with_trend_all <- lm(data = training_data , sales~lag(sales, 12)+lag(sales, 24)+lag(sales, 36)+lag(sales, 48)+ lag(insurance,12) + lag(suvs,12))
summary(model_with_trend_all)
## 
## Call:
## lm(formula = sales ~ lag(sales, 12) + lag(sales, 24) + lag(sales, 
##     36) + lag(sales, 48) + lag(insurance, 12) + lag(suvs, 12), 
##     data = training_data)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.046815 -0.012743 -0.000761  0.014144  0.042985 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)   
## (Intercept)         1.0128338  0.6817167   1.486  0.14815   
## lag(sales, 12)      0.5655907  0.1834050   3.084  0.00446 **
## lag(sales, 24)      0.0628445  0.2036901   0.309  0.75988   
## lag(sales, 36)     -0.0818047  0.2022687  -0.404  0.68886   
## lag(sales, 48)      0.3823681  0.1690217   2.262  0.03135 * 
## lag(insurance, 12) -0.0004153  0.0022620  -0.184  0.85562   
## lag(suvs, 12)      -0.0011516  0.0013233  -0.870  0.39128   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02353 on 29 degrees of freedom
##   (48 observations deleted due to missingness)
## Multiple R-squared:  0.9082, Adjusted R-squared:  0.8892 
## F-statistic:  47.8 on 6 and 29 DF,  p-value: 9.827e-14
model_with_trend_2 <- lm(data = training_data , sales~lag(sales, 12)+lag(sales, 48)+ lag(insurance,12) + lag(suvs,12))
summary(model_with_trend_2)
## 
## Call:
## lm(formula = sales ~ lag(sales, 12) + lag(sales, 48) + lag(insurance, 
##     12) + lag(suvs, 12), data = training_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.04722 -0.01169 -0.00021  0.01387  0.04380 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         1.0243068  0.6480123   1.581  0.12410    
## lag(sales, 12)      0.5722287  0.1214147   4.713 4.88e-05 ***
## lag(sales, 48)      0.3559760  0.1129101   3.153  0.00358 ** 
## lag(insurance, 12) -0.0006713  0.0020917  -0.321  0.75039    
## lag(suvs, 12)      -0.0010073  0.0012108  -0.832  0.41182    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02284 on 31 degrees of freedom
##   (48 observations deleted due to missingness)
## Multiple R-squared:  0.9075, Adjusted R-squared:  0.8956 
## F-statistic: 76.07 on 4 and 31 DF,  p-value: 1.413e-15
model_without_trend <- lm(data = training_data , sales~lag(sales, 12)+lag(sales, 48))
summary(model_without_trend)
## 
## Call:
## lm(formula = sales ~ lag(sales, 12) + lag(sales, 48), data = training_data)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.061527 -0.013950  0.000808  0.014460  0.055144 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      1.7749     0.6770   2.622   0.0131 *  
## lag(sales, 12)   0.7189     0.1272   5.651  2.7e-06 ***
## lag(sales, 48)   0.1306     0.1004   1.301   0.2023    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02573 on 33 degrees of freedom
##   (48 observations deleted due to missingness)
## Multiple R-squared:  0.8751, Adjusted R-squared:  0.8675 
## F-statistic: 115.6 on 2 and 33 DF,  p-value: 1.246e-15
# we want to predict whole year of 2019, so we need data from 2015-01-01 and after 

#Change here to see the graph of that model
model <- model_with_trend_2

# test_data <- with_trends_full %>%
#   filter(Period >="2013-01-01")
# predictions <- test_data %>%
#   add_predictions(model) %>%
#   filter(Period >= "2017-01-01" & Period <= "2018-01-01")
# MSE <- mean((predictions$pred -predictions$sales)^2)


test_data <- with_trends_full %>%
  filter(Period >="2015-01-01" & Period < "2020-01-01")
predictions <- test_data %>%
  add_predictions(model) %>% 
  filter(Period >= "2019-01-01" )


MSE <- mean((predictions$pred -predictions$sales)^2)

ggplotly(ggplot(predictions, aes(x=exp(predictions$sales), y = exp(predictions$pred)))+
  geom_point() +
  geom_abline()+
  xlab("Sales")+
  ylab("Prediction"))
(cor(predictions$sales, predictions$pred))^2
## [1] 0.9379584

We tried three version of this model, we found that model \(y_t=b_1y_{t-12}b_4y_{t-48}+b_5insurance + b_6SUVs\) makes better prediction for year 2019. MSE = 6.166375610^{-4} and \(R^2\) = 0.9379584

  • with model \(y_t = b_1y_{t-1}+b12y_{t-12}+e_t\)(AR-1 model) we observed a much better performance.However, we don’t see any improvement on the model with trends data as pridictors
## getting data ready
suvs_trends <- read_csv("trucks_suvs_2004_2020.csv")
insurance_trends <- read_csv("auto_insurance_2004_2020.csv")
# rename
names1<- names(suvs_trends)
names2 <- names(insurance_trends)
suvs_trends <- suvs_trends %>% 
  rename(suvs = names1[2],
         Period = Month)

insurance_trends <- insurance_trends %>% 
  rename(insurance = names2[2], 
         Period = Month)

unadj_full <- sales %>% filter(Period <= "2020-05-01")

trends_full <- left_join(insurance_trends, suvs_trends, by = "Period") %>% mutate(Period = as.Date(as.yearmon(Period, "%Y-%m"))) 
with_trends_full <- left_join(unadj_full, trends_full, by = "Period") %>% rename(sales = Value) 
with_trends_full$sales = log(as.numeric(with_trends_full$sales))
#with_trends_full$sales = as.numeric(with_trends_full$sales)


## starting from 2018
whole_data_2018 <- with_trends_full %>% 
  filter(Period >="2015-01-01") # we want to predict 2018-01-01 when it is 2017-10-01, we need data from 2016-10-01 
#months <- nrow(whole_data_2018)
months <- 60
 
base <- whole_data_2018
start <- 37
# creating predicted base with rolling window nowcast
for (i in start:months){
  merged_t <- whole_data_2018[1:i-1,]
  model1 <- lm(data = merged_t, sales~lag(sales, 1)+lag(sales,12))
  base$sales[i] <- predict(model1,whole_data_2018[1:i,])[i]
}
base <- base[start:months,]


# creating predicted trends with rolling window nowcast

trends <- whole_data_2018
for (i in start:months){
  merged_t <- whole_data_2018[1:i-1,]
  model1 <- lm(data = merged_t, sales~lag(sales, 1)+lag(sales,12)+ suvs + insurance)
  trends$sales[i] <- predict(model1,whole_data_2018[1:i,])[i]
}
trends <- trends[start:months,]

# Make the graph

actual <- whole_data_2018[start:months,]
actual <- actual %>% 
  mutate(label ="actual")
base <- base %>% 
  mutate(label = "base")
trends <- trends %>% 
  mutate(label ="trends")
plot_data <- rbind(actual, base, trends) 
ggplotly(ggplot(plot_data, aes(x=Period, y = sales, color = label, linetype = label))+
  geom_line()+
  scale_colour_manual(values=c("black", "red","grey"))+
  scale_linetype_manual(values = c("solid", "dashed", "solid"))+
  ylab('log(mvp)')+
  xlab('Index'))
MAE(trends$sales, actual$sales)
## [1] 0.02392399
MAE(base$sales, actual$sales)
## [1] 0.02278401
(MAE(base$sales, actual$sales)-MAE(trends$sales, actual$sales))/MAE(base$sales, actual$sales)
## [1] -0.0500341
#R^2 for base
(cor(base$sales,actual$sales))^2
## [1] 0.8951386
# R^2 for trends
(cor(trends$sales,actual$sales))^2
## [1] 0.8936784

Predict June and July 2020 with various models

  1. With model \(y_t = b_1y_{t-1}+b12y_{t-12}+e_t\)(AR-1 model)
data_for_pre_20 <- with_trends_full %>% 
  filter(Period >="2017-10-01"& Period <="2019-12-01")
model1 <- lm(data = data_for_pre_20 , sales~lag(sales, 1)+lag(sales,12))
data_2020 <- with_trends_full %>% filter(Period >="2019-01-01")
summary(model1)
## 
## Call:
## lm(formula = sales ~ lag(sales, 1) + lag(sales, 12), data = data_for_pre_20)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.041858 -0.008167 -0.002117  0.010040  0.035588 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -2.61815    1.14542  -2.286   0.0412 *  
## lag(sales, 1)   0.15053    0.07233   2.081   0.0595 .  
## lag(sales, 12)  1.07968    0.08664  12.462 3.17e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02108 on 12 degrees of freedom
##   (12 observations deleted due to missingness)
## Multiple R-squared:  0.9377, Adjusted R-squared:  0.9273 
## F-statistic: 90.28 on 2 and 12 DF,  p-value: 5.858e-08
predicted <- predict(model1,data_2020)[13:17]
MSE <- mean((predicted-data_2020$sales[13:17])^2)
mse(model1, data_2020)
## [1] 0.0540795
#We will predict June 2020 and then July 2020

n <- tidy(model1)

June <- exp(n$estimate[1] +n$estimate[2]*data_2020$sales[17]+n$estimate[3]*data_2020$sales[6])
June
## [1] 109322.3
July <-  exp(n$estimate[1] +n$estimate[2]*log(June)+n$estimate[3]*data_2020$sales[7])

July
## [1] 115180.4
  1. With model \(y_t = b_1y_{t-1}+b_2y_{t-12}+e_t+b_3insurance+b_4SUVs\)
data_for_pre_20 <- with_trends_full %>% 
  filter(Period >="2017-10-01"& Period <="2019-12-01")
model_with_trend <- lm(data = data_for_pre_20 , sales~lag(sales, 1)+lag(sales,12) + insurance + suvs)
data_2020 <- with_trends_full %>% filter(Period >="2019-01-01")
summary(model_with_trend )
## 
## Call:
## lm(formula = sales ~ lag(sales, 1) + lag(sales, 12) + insurance + 
##     suvs, data = data_for_pre_20)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.041737 -0.008079 -0.001873  0.010327  0.035907 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -2.6407681  1.4175371  -1.863   0.0921 .  
## lag(sales, 1)   0.1501538  0.0821828   1.827   0.0976 .  
## lag(sales, 12)  1.0824753  0.1104340   9.802 1.91e-06 ***
## insurance       0.0001865  0.0032765   0.057   0.9557    
## suvs           -0.0001653  0.0023871  -0.069   0.9462    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02308 on 10 degrees of freedom
##   (12 observations deleted due to missingness)
## Multiple R-squared:  0.9377, Adjusted R-squared:  0.9128 
## F-statistic: 37.64 on 4 and 10 DF,  p-value: 5.334e-06
predicted <- predict(model_with_trend ,data_2020)[13:17]
MSE <- mean((predicted-data_2020$sales[13:17])^2)
mse(model_with_trend , data_2020)
## [1] 0.05457793
#We will predict July 2020

n <- tidy(model_with_trend )

June <- exp(n$estimate[1] +n$estimate[2]*data_2020$sales[17]+n$estimate[3]*data_2020$sales[6]+n$estimate[4]*data_2020$insurance[17]+n$estimate[5]*data_2020$suvs[17])
June
## [1] 109221.7
  1. With model \(y_t = b_1y_{t-2}+b_2y_{t-14}+e_t+b_3insurance+b_4SUVs\)(AR-1 model) Bad model
data_for_pre_20 <- with_trends_full %>% 
  filter(Period >="2017-10-01"& Period <="2019-12-01")
model_with_trend <- lm(data = data_for_pre_20 , sales~lag(sales, 2)+lag(sales,14) + insurance + suvs)
data_2020 <- with_trends_full %>% filter(Period >="2019-01-01")
summary(model_with_trend )
## 
## Call:
## lm(formula = sales ~ lag(sales, 2) + lag(sales, 14) + insurance + 
##     suvs, data = data_for_pre_20)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.143274 -0.022322 -0.003005  0.036227  0.083648 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)  
## (Intercept)     5.262405   4.386827   1.200   0.2646  
## lag(sales, 2)   0.405251   1.185718   0.342   0.7413  
## lag(sales, 14)  0.032898   1.421701   0.023   0.9821  
## insurance      -0.011093   0.012547  -0.884   0.4024  
## suvs            0.019694   0.009837   2.002   0.0803 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0743 on 8 degrees of freedom
##   (14 observations deleted due to missingness)
## Multiple R-squared:  0.431,  Adjusted R-squared:  0.1465 
## F-statistic: 1.515 on 4 and 8 DF,  p-value: 0.2855
predicted <- predict(model_with_trend ,data_2020)[13:17]
MSE <- mean((predicted-data_2020$sales[13:17])^2)
mse(model_with_trend , data_2020)
## [1] 0.006739459
#We will predict July 2020

n <- tidy(model_with_trend )

July <- exp(n$estimate[1] +n$estimate[2]*data_2020$sales[17]+n$estimate[3]*data_2020$sales[5]+n$estimate[4]*data_2020$insurance[17]+n$estimate[5]*data_2020$suvs[17])
July-115180.4
## [1] -7044.656
  1. \(y_t=b_1y_{t-12}+b_2y_{t-24}+b_3y_{t-36}+b_4y_{t-48}+b_5insurance + b_6SUVs\)
data_for_pre_20 <- with_trends_full %>% 
  filter(Period >="2010-01-01"& Period <="2016-12-01")
# model_with_trend <- lm(data = data_for_pre_20 , sales~lag(sales, 12)+lag(sales, 24)+lag(sales, 36)+lag(sales, 48)+ insurance + suvs)
model_with_trend <- lm(data = data_for_pre_20 , sales~lag(sales, 12)+lag(sales, 48)+ insurance + suvs)

data_2020 <- with_trends_full %>% filter(Period >="2013-01-01")
summary(model_with_trend )
## 
## Call:
## lm(formula = sales ~ lag(sales, 12) + lag(sales, 48) + insurance + 
##     suvs, data = data_for_pre_20)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.045072 -0.014807  0.000768  0.009903  0.041889 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     0.4942178  0.4559826   1.084   0.2868    
## lag(sales, 12)  0.5807464  0.0854021   6.800 1.29e-07 ***
## lag(sales, 48)  0.3957707  0.0801927   4.935 2.59e-05 ***
## insurance       0.0016225  0.0015207   1.067   0.2942    
## suvs           -0.0021911  0.0009134  -2.399   0.0226 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01921 on 31 degrees of freedom
##   (48 observations deleted due to missingness)
## Multiple R-squared:  0.9596, Adjusted R-squared:  0.9543 
## F-statistic: 183.9 on 4 and 31 DF,  p-value: < 2.2e-16
predicted <- predict(model_with_trend ,data_2020)[49:89]
MSE <- mean((predicted-data_2020$sales[49:89])^2)
mse(model_with_trend , data_2020)
## [1] 0.01147997
#We will predict July 2020

n <- tidy(model_with_trend )

July <- exp(n$estimate[1] +n$estimate[2]*data_2020$sales[17]+n$estimate[3]*data_2020$sales[5]+n$estimate[4]*data_2020$insurance[17]+n$estimate[5]*data_2020$suvs[17])
July-115180.4
## [1] -8875.708